{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### Lab 5 - Mean, Median, and Mode\n", "\n", "In this lab, we will learn how some different ways to measure the *central tendancy* of data: the mean, median, and mode. \n", "\n", "The [Museum of Modern Art (MoMA)](https://www.moma.org) has uploaded data about the artwork and artists in its collection at GitHub: [https://github.com/MuseumofModernArt/collection](https://github.com/MuseumofModernArt/collection). We will work with the artwork dataset, [Artworks.csv](https://github.com/MuseumofModernArt/collection/blob/master/Artworks.csv). Download this dataset and upload it to your Jupyter Hub.\n", "\n", "As usual, we will import the matplotlib and pandas packages, and set plots to appear in the Jupyter notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load the data from the CSV file into a dataframe called `art`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check that the dataframe was created properly by displaying it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Which columns contain categorical data?\n", "\n", "The *mode* is the data value that appears most frequently. It is the main way that central tendancy is measured for categorical data. We will compute the mode of the `Medium` column which is what material(s) the artwork was made of.\n", "\n", "One way to find the mode is to compute the value counts, and look at the top value. Try this below. We computed the value counts in Lab 3." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", " medium_counts = art[\"Medium\"].value_counts()\n", "medium_counts\n", "
\n", "\n", "What is the mode? Does this make sense? (Google the material if you are unfamiliar with it.)\n", "\n", "Another way to compute the mode is to use the function `mode()`. Type the code `art[\"Medium\"].mode()` below and run it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Did you get the same mode as with the value counts?\n", "\n", "Next we will look at two measures of central tendancy for quantitative data: the mean and the median. We will use the height of the art work for the quantitative data. Look at the dataframe to see what the height column is called, and write code to display that column below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Pattern:\n", " dataframe_name[\"column_name\"]\n", "
\n", "\n", "Plot the histogram of the heights below to see the distribution of this (statistical) variable. We learned how to make histograms in Lab 4." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", " art[\"Height (cm)\"].hist(bins = 80)\n", "
\n", "\n", "What do you notice about this distribution? What do you think causes this?\n", "\n", "Extreme data values are called *outliers*. Let's look at the artwork with the greatest height. We can get the maximum data value with the function `max()`. To get the maximum height, type the code `art[\"Height (cm)\"].max()` below and run it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Do you think MoMA has an artwork that is 91m high? Let's investigate further. To find the row with this max height, we can use the function `idxmax()` which finds the row number of the maximum. We will save this in the variable `max_row` with the code `max_row = art[\"Height (cm)\"].idxmax()`. Type and run it below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What row is the artwork with the maximum height?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To display the full row of data about this artwork, we use the code `.iloc[]`. `iloc` is *not* a function because it is followed by square brackets instead of parentheses, but it works similarly to one. Type and run the code `art.iloc[max_row]` below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What do you notice? Is the artwork actually 91.4m high?\n", "\n", "In this case, the large height is a error. To be able to plot a more informative histogram, we are going to remove all artwork with heights that are outliers. We will do this by filtering the data with Python.\n", "\n", "First we will make a new dataframe containing only artworks with heights less than 500cm. Run the following code. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "height_filter = art[\"Height (cm)\"] < 500\n", "height_filter" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What do you think is stored in the variable `height_filter`?\n", "\n", "We can now create a new dataframe, containing only the rows marked as `True` in `height_filter`. Type and run the code `new_art = art[height_filter]` below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To test that this worked, plot the histogram of the height column in the new dataframe." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can compute the mean height of artwork in our filtered dataframe with the code `new_art[\"Height (cm)\"].mean()`. Try it below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can compute the median height of the artwork in our filtered dataframe with the code `new_art[\"Height (cm)\"].median()`. Try it below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now compute the mean and median heights of artwork in the original dataframe." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Did the mean or median change? Why/why not?\n", "\n", "#### Challenges:\n", "- Plot a histogram of the width of the artworks. Filter it and compare the mean and median of the filtered and unfiltered data.\n", "- Compute the mean and median green taxi trip distances.\n", "- Compute the mode of the green taxi trip payment type." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }